Speech Emotion Recognition System

Introduction

Speech Emotion Recognition (SER) is the task of recognizing the emotional aspects of speech irrespective of the knowing the meaning of contents. The ability to express emotions is one of the defining aspects of sentient creature. While humans can efficiently understand the non-verbal cues and are even able to perform this task within a conversation, the ability to computers to pick this has been an ongoing subject of research. Robots capable of understanding emotions could provide appropriate emotional responses and exhibit emotional personalities. In some circumstances, humans could be replaced by computer-generated characters having the ability to conduct very natural and convincing conversations by appealing to human emotions. Machines need to understand emotions conveyed by speech. Only with this capability, an entirely meaningful dialogue based on mutual human-machine trust and understanding can be achieved. emotion recognition systems aim to create efficient, real-time methods of detecting the emotions of mobile phone users, call center operators and customers, car drivers, pilots, and many other human-machine communication users. Adding emotions to machines has been recognized as a critical factor in making machines appear and act in a human-like.

Scope and Purpose

Human emotions can be detected from various channels, such as speech, body language, facial expressions, and text. The most obvious channel for emotion recognition is through speech. Just like studying body language and text to process user sentiment, speech signals can encompass a wealth of information related to emotional characteristics. While this is an easy task for humans, computers still have a long way to go before emotion recognition becomes a form of artificial intelligence. The biggest impediment in using speech is that there is no discrete speech feature that correlates to reflect the speaker’s emotions directly. An added challenge is the problem of having limited training data and low prediction accuracy. Understanding human emotions is critical in areas such as psychology, criminology, banking, and insurance predominantly to aid in appropriate corrective action in cases of crime, fraud, etc. In these contexts, emotions in speech can be used to infer various facets of human behaviour irrespective of language, ethnicity, and other such distinguishing factors. Building a model that can identify emotions from speech can enable integration into such platforms to aid in decision making.

Audio Preprocessing

The data is an audio file and there are multiple ways the features can be extracted from the audio file. The sound excerpts are digital audio files in the format of .wav file. Sampling is the process of digitizing the sound waves of continuous signal to a series of series of discrete signals.

Since the audio file is saved in the .wav format, it is easy to use Librosa or other library like Torchaudio.

The speech files which are in the format of .wav file are digitized using the Librosa/ Torch audio library. These digitised samples are trimmed to remove the leading and trailing silences, and zero padded to make them consistent in length for future processing. This signal is then transformed to the frequency domain using the short-term Fourier transform, which now makes the audio file amenable to feature extraction.

Feature Extraction:

Spectrogram and Mel-Spectograms:

The Mel-spectrogram captures the visual representation of temporal changes in the energy of different frequency bands. Humans cannot perceive frequencies on a linear scale. The Mel scale overcomes this by adopting a non-linear scale to ensure that equal distance between frequencies sound equally different for human ears. These features would then be able to lend themselves to deep learning models. The Mel-spectrogram of an audio file from the RAVDESS database after the pre-processing steps is as shown in Fig. Fig. 2.1.a.

Fig [2.1.a]
The main difference between Mel-spectrogram and a spectrogram is that a spectrogram uses a linear spaced frequency scale (so each frequency bin is spaced an equal number of Hertz apart) and Mel-spectrograms are in Mel scale,

MFCC's:

In the context of speech, MFCCs are the commonly used features and contain the “voice fingerprint” of speech . While the proposed methodology explores MFCCs as features for emotion recognition, it also looks at using Mel-spectrograms as features

Fig [2.1.b]
Computing the MFCC involves calculating the logarithm of the Mel-spectrogram followed by using the Discrete Cosine Transform (DCT) operation. MFCC is a compressed representation of the Mel-spectrogram offering the possibility of having a computationally less intensive solution. MFCC’s contain 40 coefficients out of which the first 13 are important to capture the frequency changes in the Mel-spectrogram. Fig. 2.1.b shows the MFCCs computed from the Mel-spectrogram (Fig. 2.1.a).

Wav2Vec2 Feature Extractor

A feature extractor is in charge of preparing input features for a multi-modal model. This includes feature extraction from sequences, e.g., pre-processing audio files to Log-Mel Spectrogram features, feature extraction from images e.g. cropping image image files, but also padding, normalization, and conversion to Numpy, PyTorch, and TensorFlow tensors.

In Hugging Face Transformers, the Wav2Vec2 model is thus accompanied by both a tokenizer, called Wav2Vec2CTCTokenizer, and a feature extractor, called Wav2Vec2FeatureExtractor.

An audio file usually stores both its values and the sampling rate with which the speech signal was digitalized. We want to store both in the dataset and write a map(...) function accordingly.

wav2vec2_structure

The pretrained Wav2Vec2 checkpoint maps the speech signal to a sequence of context representations as illustrated in the figure above. A fine-tuned Wav2Vec2 checkpoint needs to map this sequence of context representations to its corresponding transcription so that a linear layer has to be added on top of the transformer block (shown in yellow).

Data augmentation

The most popular speech dataset considered (RAVDESS) has a total of 1440 points for a set of 8 emotions. This data was recorded in a studio environment to ensure no noise on speakers with North American accent. The other dataset, CREMA-D is another English language by a larger demographics of speakers with a variety of accents and has 7442 data in total. The total data points commercially available to build a robust speech emotion recognition system suffers from the disadvantage of having a too few for the state of art deep learning models as neural networks can presently support parameters in the order of millions and to obtain a reliable performance there is a dire need to feed in a proportionally large amount of data to improve the performance. Therefore, augmentation techniques are performed before training the dataset with a deep learning model. The data post augmentation is then used for analysis and processing. For a given spectrogram x axis represents the time and y-axis represents the frequency. There are several augmentation techniques. We focus on the following to augment the data and perform the analysis for these augmentation techniques:

Windowing:

Speech signals are time-variant in nature and to extract the information from a signal, it is broken down into shorter segments. We break down the input signals to temporal segments. The data is sampled at 8000Hz and the audio files are an approximate of ~3 seconds and ~1.5 post snipping the leading and lagging silence. The frame size of each window is set as .5s. The hop length for the audio is .25 seconds in time.

Fig [3.1.a] Original Spectrogram
  • Case 1 : The frame size of each window is set as 1 s
    Fig [3.1.b] Example of windowing a signal at 1s frame size hop and .25 seconds.
  • Case 2 : The frame size of each window is set as ~.5 s

    Fig [3.1.c] Example of windowing a signal at .5 frame size hop and .25 seconds.

SpecAugmentation :

The paper discusses three representative techniques for data augmentation through the following techniques:

  • 1) Frequency masking A frequency channels [f0, f0 + f) are masked. f is chosen from a uniform distribution from 0 to the frequency mask parameter F, and f0 is chosen from (0, ν − f) where ν is the number of frequency channels.

    Fig [3.2.a] Example of applying Frequency masking on a the spectogram signal
  • 2) Time masking t consecutive time steps [t0, t0 + t) are masked. t is chosen from a uniform distribution from 0 to the time mask parameter T, and t0 is chosen from [0, τ − t).

    Fig [3.2.b] Example of applying Frequency masking on a the spectogram signal
  • 3)Time warping

A random point is chosen and is warping to either left or right with a distance w which chosen from a uniform distribution from 0 to the time warp parameter W along that line. As the paper discussed that time warping did not improve model performance, the experiments conducted considers only frequency and time masking.

  • 4)Combining Frequency and Time masking

    Fig [3.2.c] Example of applying Frequency and Time masking on a the spectogram signal

Environment

The Models were trained on Jupyter notebooks with Anaconda Python enviroment. Reproducing the results would require training the models and installing the necesssary packages.

  • The environments used for SER are in the folder : SER/Environment

The Anaconda python environment where the model had been trained is exported to a YML file. The enviroment can be set up in a new computer by running the following command:

  • conda env create -f transformers_4p8_env.yml


Using this command installs the transformers_4p8_env environment. This environment will get installed in their default conda environment path.

If you want to specify a different install path than the default for your system (not related to 'prefix' in the environment.yml), just use the -p flag followed by the required path.

conda env create -f environment.yml -p /home/user/anaconda3/envs/env_name

Datasets:

The emotion corpus that have been annotated and evaluated till date are :

  • Location of Datasets : The following emotion datasets mentioned can be located in _SER/EmotionData

RAVDESS

Dataset location: _EmotionData/Datasets/RAVDESS

Data Description

Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) is a widely used database for Emotion Recognition data. The dataset contains 24 professional actors (12 female and 12 males) who vocalize on two lexically matched statements in 8 sets of emotion (anger, disgust, fear, happiness, surprise, sadness, and neutral, calm).

All speakers in this database have a North American accent. The modalities covered in the database are: Audio (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), Video only. There are 1440 stimuli in total. The reference to file naming conventions used for pre-processing the data [xx]

In the experiments, only the audio speech data is used to test the performance of the model as the aim is to acquire emotions from spoken audio.

Fig [4.a] Fig [4.b]

Figures [4.1.a] - [4.1.d] represent the distribution of the data RAVDESS audio data. Fig [4.1.a] is distribution of emotions in RAVDESS data. There are 8 emotions in total. Most of the emotions but for neutral are balanced in the dataset. Fig [4.1.b] represents the distribution of Actor’s Gender in RAVDESS dataset with samples in Male and Female being balanced.

Fig [4.1.c] Fig [4.1.d]

Fig [4.1.c] represents the distribution of Intensity of Emotions, there are fewer samples strong in emotions in this dataset. This is because there is not strong intensity for neutral emotion Fig [4.1.d] represents the distribution of two Statements used throughout RAVDESS data, which is balanced

Audio Pre-processing and Feature Extraction

Scripts :

In [13]:
import librosa,librosa.display
import IPython.display as ipd
import numpy as np
import os
def melspectrogram(audio, sample_rate):
    mel_spec = librosa.feature.melspectrogram(y=audio,
                                              n_fft=1024,
                                              win_length = 512,
                                              window='hamming',
                                              hop_length = 100,
                                              n_mels=128,
                                              fmax=sample_rate/2
                                             )
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
    return mel_spec_db

def load_audio(aud_loc,SR):
    audio, sample_rate = librosa.load(aud_loc, offset=0,sr=SR)
    trimmed , idx_trimmed = librosa.effects.trim(audio, top_db=30)
    mel_spectrogram = melspectrogram(trimmed, SR)
    librosa.display.specshow(mel_spectrogram, y_axis='mel', x_axis='time')
    print('MEL spectrogram shape: ',mel_spectrogram.shape)
    print('The Original Audio can be found in path :\n {}'.format(aud_loc))
    return ipd.Audio(trimmed, rate=SR)
In [14]:
audio_loc= 'Emotion_Data/Datasets/RAVDESS/Audio_Speech_Actors_01-24/Actor_01/03-01-01-01-01-02-01.wav'
load_audio(audio_loc,SR=8000)
MEL spectrogram shape:  (128, 118)
The Original Audio can be found in path :
 Emotion_Data/Datasets/RAVDESS/Audio_Speech_Actors_01-24/Actor_01/03-01-01-01-01-02-01.wav
Out[14]:

Data Annotations

File naming convention for RAVDESS dataset is:

Each of the RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:

Filename identifiers :

  • Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
  • Vocal channel (01 = speech, 02 = song).
  • Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
  • Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the ‘neutral’ emotion.
  • Statement (01 = “Kids are talking by the door”, 02 = “Dogs are sitting by the door”).
  • Repetition (01 = 1st repetition, 02 = 2nd repetition).
  • Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).
    Filename example: 02-01-06-01-02-01-12.mp4

    • Video-only (02)
    • Speech (01)
    • Fearful (06)
    • Normal intensity (01)
    • Statement “dogs” (02)
    • 1st Repetition (01)
    • 12th Actor (12)
    • Female, as the actor ID number is even.

TESS

Dataset location: _EmotionData/Datasets/TESS

Data Description

The TESS (Tess Emotional Speech Set) consists of a phrase “Say the word” followed by a set of 200 target words were spoken by two female actors. The younger actor is aged at 24 years and older actor, aging 64 years. The recordings of the set portrayed each of the 7 emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral. There are 2800 stimuli in total.

Data Distribution

In the experiments,the audio speech and transcripts are used to test the performance of the model as the aim is to acquire emotions either induvidually / through fusion (Audio and Text).

Fig [4.2.a] Fig [4.2.b]
Fig [4.2.a] represents the distribution of data based on the actor’s age. The Young female and old female actor’s samples are balanced
Fig [4.2.b] depicts the distribution of emotions in TESS data. There are 7 emotions in total. All emotions are balanced with 400 stimuli for each emotion.

Pre-processing

Scripts :

In [15]:
audio_loc= 'Emotion_Data/Datasets/TESS/OAF_Fear/OAF_back_fear.wav'
load_audio(audio_loc,SR=8000)
MEL spectrogram shape:  (128, 134)
The Original Audio can be found in path :
 Emotion_Data/Datasets/TESS/OAF_Fear/OAF_back_fear.wav
Out[15]:

Data Annotations

Scripts :

The data naming convention for TESS is straight forward. The files are ordered by emotions as seen below in 4.2.c.

Fig [4.2.c] Fig [4.2.d]
Fig [4.2.c] represents the distribution of data based on the actor’s age. The Young female and old female actor’s samples are balanced
Fig [4.2.d] the file naming convection for TESS is based on the emotion. The same words are uttered with differnt emotions and therefore there is no benifit it using the text transcripts in this dataset

Fig [4.2.e] the data is loaded into a dataframe. The dataframe contains essential information like the path and emotion

Audio Feature Extraction:

The Mel-spectrogram for the TESS files are saved in the directory :
_Emotion_Data/Datasets/TESSMEL/
Example of an audio file:

Fig [4.2.f] An extracted Mel-spectrogram from TESS dataset stored as an image

Models with TESS dataset ad their Results:

IEMOCAP

Data Description

The IEMOCAP dataset consists of 151 videos of recorded dialogues, with 2 speakers per session for a total of 302 videos across the dataset. Each segment is annotated with a presence of upto 9 emotions (angry, excited, fear, sad, surprised, frustrated, happy, disappointed and neutral). Apart from this the dataset contains other essential markers for emotinos such as valence, arousal and dominance. The dataset is recorded across 5 sessions with 5 pairs of speakers.

Some studies conducted on emotion recogontion considers the emotions in a 3 dimensional spectrum scale where the valance, arousal and dominance comes into emphasis. However, the experiment conducted considers each emotion as discrete value. All speakers in this database have a North American accent. The modalities covered in the database are: Audio (16bit, 48kHz .wav). There are 1440 stimuli in total.

Data Distribution

In the experiments,the audio speech and transcripts are used to test the performance of the model as the aim is to acquire emotions either induvidually / through fusion (Audio and Text).

Fig [4.3.a] Fig [4.3.b]

Figures [4.3.a] - [4.3.b] represent the distribution of the data EMOCAP data. Fig [4.3.a] is distribution of emotions in IEMOCAP data before removing the un-balanced emotions. There are 9 emotions in total and can be clearly seen that the emotions are not balanced. The emotions lesser than 10 are combined together to form other emotions. The data is not balanced in terms of emotion distribution. In Fig [5.a] there are 1600+ data points for Neutral and 1000+ data points for angry, sad and excited and happy has 100. Training the deep learinnig model with this imbalance will cause the model to be biased with the emotion with maximum data points (e.g Neutral). Therefore, to train this data, only the following emotions are considered : Happy and Excited, Neutral,Angry and Sad and other emotions are marked as others as see in the Fig [xx]. Fig [4.3.b] represents the distribution of emotions in IEMOCAP dataset. Only the emotions Happy and exicted,Neutral,Angry and Sad are used in the experiments.

Fig [4.3.c] Fig [4.3.d]

Fig [4.3.c] represents the distribution of Intensity of Emotions, there are fewer samples strong in emotions in this dataset. This is because there is not strong intensity for neutral emotion Fig [4.3.d] represents the distribution of two Statements used throughout RAVDESS data, which is balanced

Pre-processing

Audio-preprocessing

Scripts :

The IEMOCAP dataset consists of videos,audios and transcripts containing either scripted or improvised utterances. The video data consisting of the facial landmarks are not regarded in the experiments.

The dataset is spread out in sessions between (Session 1- Session 5) containing the transcripts, audio utterance across an entire conversation. The Pre-processing scripts will segment the entire uttarence based on each speaker. This is done with the transcripts which contains speaker information like start time and stop time, dialogue and emotion included in this dataset.

The reference to file naming conventions used for pre-processing the data [xx] is shown below: Fig[xx] below shows an example of the audio before and after pre-processing. The Pre-processing scripts will segment the whole utterence in the DIR to setences it is located in DIR.

Fig [4.3.e]
Fig [4.3.f]
In [16]:
audio_loc_utt= 'Emotion_Data/IEMOCAP_full_release/Session5/dialog/wav/Ses05F_impro04.wav'
load_audio(audio_loc_utt,SR=8000)
MEL spectrogram shape:  (128, 32272)
The Original Audio can be found in path :
 Emotion_Data/IEMOCAP_full_release/Session5/dialog/wav/Ses05F_impro04.wav
Out[16]:
In [17]:
audio_loc_sent= 'Emotion_Data/IEMOCAP_full_release/Session5/sentences/wav/Ses05F_impro04/Ses05F_impro04_F003.wav'
load_audio(audio_loc_sent,SR=8000)
MEL spectrogram shape:  (128, 332)
The Original Audio can be found in path :
 Emotion_Data/IEMOCAP_full_release/Session5/sentences/wav/Ses05F_impro04/Ses05F_impro04_F003.wav
Out[17]:

Transcripts and Text Preprocessing

Scripts :

The IEMOCAP dataset consists of videos,audios and transcripts containing either scripted or improvised utterances. The transcripts contains essential speaker information like start time and stop time, dialogue and emotion included in this dataset.

They dialogue are sent to the text model and the start time and stop time is used to segment the audio

Fig [4.3.g]

The Fig [4.3.g] shows the information included in the transcripts for a sample utterance Ses05_impo04

Data Extraction and Annotations

Scripts :

The scripts referenced below will search through thr transcripts directory and gather information on the number of files available. This then generates a file containing the transcripts.csv and processed_tran.csv. (DIR + IEMOCAP/Audio/processed_tran.csv)

_Shreyah_code/IEMOCAP/Audio/Preprocessingscript.ipynb. Running the script will produce the files processed_tran.csv and processed_label.txt which contain the SessionID, Dialogue and the labels respectively

Fig [4.3.h]

Similary by summing the number of files in all of the sessions from the directory/Emoevaluation we can confirm the overall main utterances match with transcripts.

The directory sentences have the segmented utterences and summing them yields the results for all sessions. diagues for the session match with transcripts as seen below. After summing all the sentences for all sessions we have around 10000 sentences.

Fig [4.3.i]

The processed_tran.csv and processed_label.txt are then merged together into a dataframe shown below. The fields contain Session_ID to access the Path to the segmented wav file, Emotions and Text

Fig [4.3.j]

Text Pre-processing

The text pre-processing involved the same steps of merging together the processed_tran.csv and processed_label.txt into a dataframe. The location of these files are in DIR :

CREMA-D

(Crowd-sourced Emotional Multimodal Actors Dataset)

License: This Crowd-sourced Emotional Mutimodal Actors Dataset (CREMA-D) is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/

Data Description

CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified).

Actors spoke from a selection of 12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust, Fear, Happy, Neutral and Sad) and four different emotion levels (Low, Medium, High and Unspecified).

Participants rated the emotion and emotion levels based on the combined audiovisual presentation, the video alone, and the audio alone. Due to the large number of ratings needed, this effort was crowd-sourced and a total of 2443 participants each rated 90 unique clips, 30 audio, 30 visual, and 30 audio-visual. 95% of the clips have more than 7 ratings.

The description below specifies the data made availabe in this repository.

For a more complete description of how CREMA-D was created use this link or the link below to the paper.

Audio Files_ MP3 Audio files used for presentation to the Raters are stored in the AudioMP3 directory.

Processed Audio WAV Audio files converted from the original video into a format appropriate for computational audio processing are stored in the AudioWAV directory.

Data Distribution

In the experiments,the audio speech and transcripts are used to test the performance of the model as the aim is to acquire emotions either induvidually / through fusion (Audio and Text).

Fig [4.3.b]

Fig [4.3.b] depicts the distribution of emotions in CREMA-D data. There are 6 emotions in total. Almost all emotions are balanced with 1200 stimuli for each emotion.

Pre-processing

Scripts :

In [18]:
audio_loc= 'Emotion_Data/cremad/AudioWAV/1091_WSI_NEU_XX.wav'
load_audio(audio_loc,SR=8000)
MEL spectrogram shape:  (128, 187)
The Original Audio can be found in path :
 Emotion_Data/cremad/AudioWAV/1091_WSI_NEU_XX.wav
Out[18]:

Data Annotations

The data naming convention for CREMA-D isFilename labeling conventions The Actor id is a 4 digit number at the start of the file. Each subsequent identifier is separated by an underscore (_).

Actors spoke from a selection of 12 sentences (in parentheses is the three letter acronym used in the second part of the filename):

  • It's eleven o'clock (IEO).
  • That is exactly what happened (TIE).
  • I'm on my way to the meeting (IOM).
  • I wonder what this is about (IWW).
  • The airplane is almost full (TAI).
  • Maybe tomorrow it will be cold (MTI).
  • I would like a new alarm clock (IWL)
  • I think I have a doctor's appointment (ITH).
  • Don't forget a jacket (DFA).
  • I think I've seen this before (ITS).
  • The surface is slick (TSI).
  • We'll stop in a couple of minutes (WSI).

The sentences were presented using different emotion (in parentheses is the three letter code used in the third part of the filename):

  • Anger (ANG)
  • Disgust (DIS)
  • Fear (FEA)
  • Happy/Joy (HAP)
  • Neutral (NEU)
  • Sad (SAD)
    Emotion level (in parentheses is the two letter code used in the fourth part of the filename):

  • Low (LO)

  • Medium (MD)
  • High (HI)
  • Unspecified (XX)
    The suffix of the filename is based on the type of file, flv for flash video used for presentation of both the video only, and the audio-visual clips. mp3 is used for the audio files used for the audio-only presentation of the clips. wav is used for files used for computational audio processing.

Fig [4.4.a]
Fig [4.4.a] represents the file structure and naming convention in CREMA-D.

Audio Feature Extraction:

The Mel-Spectrograms and Windows extracted are in the directory DIR : Emotion_Data/cremad/split/
Apart from the audiofiles directory in "SER/Emotion_Data/cremad/AudioWAV/" there is another folder containing the pre-processed spectrogram as images in / Emotion_Data/cremad/split/ . The script will divided to train, test and validation and lets the images be saved in the respective directories

An example of some of the pre-saved Mel-spectrograms and Windowed mel spectrograms are below:

Transcripts and Text Preprocessing

The CREMA dataset is ideally to test the efficacy of model as it consists of videos and audios for a much wider range of participants 90 people, has a diverse range of participants with various accents and the data is crowd sourced making this data prone to external noise. However, there is the hypothesis that real life emotions are not simila to acted emotions.
The sentences are majorly scripted, the same phases is spoken with different emotions and text extracted in this case is then not useful.

CMU-MOSEI

(CMU Multimodal Opinion Sentiment and Emotion Intensity)

License: Copyright 2018 The CMU-MultimodalSDK Contributers

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. FURTHERMORE, IN NO EVENT SHALL THE CREATORS OF THE SOFTWARE BE RESPONSIBLE FOR CONSUMER ABUSE (COPYRIGHT VIOLATION, SUBJECT COMPLAINTS) OF DATASETS BOTH STANDARD AND NON-STANDARD DATASETS. THE DATASETS AND MODELS ARE PROVIDED AS IS. USERS ARE RESPONSIBLE FOR PROPER USAGE OF DATASETS AND MODELS INCLUDING RIGHTS OF SUBJECTS WITHIN THE VIDEOS AS WELL AS PROPER CONDUCT WHEN IT COMES TO PATENTED MODELS, DATASETS OR SCIENTIFIC FINDINGS (THROUGH CITING PATENTS OR SCIENTIFIC PAPERS).

Data Description

CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) is the largest dataset of sentence level sentiment analysis and emotion recognition in online videos. CMU-MOSEI contains more than 65 hours of annotated video from more than 1000 speakers and 250 topics. Each video segment contains manual transcription aligned with audio to phoneme level.

All the videos are gathered from online video sharing websites.

The dataset is currently a part of the CMU Multimodal Data SDK and is freely available to the scientific community through Github 2The dataset was introduced in the 2018 Association for Computational Linguistics 2018 and used in the co-located First Grand Challenge and Workshop on Human Multimodal Language.

Download data

Scripts to download CMU data i.e features from the server :

_Shreyah_code/MOSEI/Preprocessing/data/Downloaddata.ipynb

The script downloads the data and stores them in the directory : _Shreyah_code/MOSEI/Preprocessing/Preprocesseddata

CMU-Multimodal SDK provides tools to easily load well-known multimodal datasets and rapidly build neural multimodal deep models. Hence the SDK comprises of two modules: 1) mmdatasdk: module for downloading and procesing multimodal datasets using computational sequences. 2) mmmodelsdk: tools to utilize complex neural models as well as layers for building new models. The fusion models in prior papers will be released here.

All the datasets here are processed using the SDK (even the old_processed_data folder which uses SDK V0).

The following link contains the word-aligned data and models to run experiments on (without any new alignment techniques)

Data: http://immortal.multicomp.cs.cmu.edu/raw_datasets/processed_data/

--> Old preprocessed datasets used in our papers are available at: http://immortal.multicomp.cs.cmu.edu/raw_datasets/processed_data/

Audio Files_ MP3 Audio files used for presentation to the Raters are stored in the AudioMP3 directory.

Processed Audio WAV Audio files converted from the original video into a format appropriate for computational audio processing are stored in the AudioWAV directory.

Data Distribution

A single audio can exhibit multiple emotions simultaneously where up to 6 emotions (Happy, Angry, Sad, Disgust, Fear) can co-exist. However, the emotions were heavily imbalanced in the data, biasing the model towards predicting emotions from the majority emotion classes.

Fig [4.5.2.b]
Fig 4.5.2.c]

Fig [4.5.2.b] depicts the distribution of emotions. The graph shows that multiple emotins co-exist. i.e. a single audio can exhibit multiple emotions simultaneously where up to 6 emotions (Happy, Angry, Sad, Disgust, Fear) can co-exist

Fig [4.5.2.c] depicts the table containing the CMU data and it's associate emotions preprocessing. It can be seen the file -3g5yACwYnA[2] in dataframe form showing three emotions of Happy, Sad and Anger at the same time.

The audio playback of the file -3g5yACwYnA[2] is refernece in the cell below:

In [19]:
audio_loc= 'Emotion_Data/CMU_MOSEI/Raw/Audio/Segmented/train/-3g5yACwYnA[2].wav'
load_audio(audio_loc,SR=8000)
MEL spectrogram shape:  (128, 737)
The Original Audio can be found in path :
 Emotion_Data/CMU_MOSEI/Raw/Audio/Segmented/train/-3g5yACwYnA[2].wav
Out[19]:

The CMU_MOSEI has effective around (65 hours) of audio data. However the challenge with the dataset is the format format annotations where multiple emotions can exist at the same time and the imbalance of the emotions. The emotion distribution is seen in the image below :

Fig [5.b]
Fig [5.C] graph shows that "Happy" emotion occurs a lot more often than emotions like Fear and Surprise.

Pre-processing

Sample of the segmented audio

The CMU_MOSEI dataset consists of videos,audios and transcripts containing either scripted or improvised utterances. In the experiments conducted, the video data is not regarded.

The transcripts contain the essential information on the speaker time and stop time and audio utterance across an entire conversation. The Pre-processing scripts will segment the entire uttarence based on speaker start time and stop time mentioned in the transcripts file. The transcripts contains speaker information start time and stop time, dialogue and emotion which are segmented by the Preprocessing script.

Fig[xx] below shows an example of an utterance in IEMOCAP is Ses05_imporo04.wav refers to an improvised utterence 04 of a Session 5. The whole utterence in the DIR is segmented with the Pre-processing scripts located in DIR.

In [20]:
audio_loc='Emotion_Data/CMU_MOSEI/Raw/Audio/Segmented/test/zvZd3V5D5Ik[1].wav'
load_audio(audio_loc,SR=8000)
MEL spectrogram shape:  (128, 769)
The Original Audio can be found in path :
 Emotion_Data/CMU_MOSEI/Raw/Audio/Segmented/test/zvZd3V5D5Ik[1].wav
Out[20]:

Data Annotations

Scripts :

Annotation of CMU-MOSEI follows closely the annotation of CMU-MOSI (Zadeh et al., 2016a) and Stanford Sentiment Treebank (Socher et al., 2013).


Each sentence is annotated for sentiment on a [-3,3] Likert scale of: [−3: highly negative, −2 negative, −1 weakly negative, 0 neutral, +1 weakly positive, +2 positive, +3 highly positive]. Ekman emotions (Ekman et al., 1980) of {happiness, sadness, anger,fear, disgust, surprise} are annotated on a [0,3] Likert scale for presence of emotion x: [0: no evidence of x, 1: weakly x, 2: x, 3: highly x].
The annotation was carried out by 3 crowdsourced judges from Amazon Mechanical Turk platform. To avert implicitly biasing the judges and to capture the raw perception of the crowd, we avoided extreme annotation training and instead provided the judges with a 5 minutes training video on how to use the annotation system. All the annotations have been carried out by only master workers with higher than 98% approval rate to assure high quality annotations 4 . Figure 2 shows the distribution of sentiment and emotions in CMU-MOSEI dataset. The distribution shows a slight shift in favor of positive sentiment which is similar to distribution of CMU-MOSI and SST.
This is an implicit bias in online opinions being slightly shifted towards positive, since this is also present in CMU-MOSI. The emotion histogram shows different prevalence for different emotions.
The most common category is happiness with more than 12,000 positive sample points. The least prevalent emotion is fear with almost 1900 positive sample points which is an acceptable number for machine learning studies.

Audio Feature Extraction:

The Mel-Spectrograms and Windows extracted are in the directory DIR : Emotion_Data/CMU_MOSEI/Specterogram/Windows/ As the audio files can range between 30 seconds to couple of minutes, it has to be windowed. The windows were tasken has a hop length of 2 seconds and window frame size of 4 seconds.
Apart from the audiofiles directory in "Emotion_Data/CMU_MOSEI/Raw/Audio/Full" there is another folder containing the pre-processed spectrogram as windows in the DIR Emotion_Data/CMU_MOSEI/Spectrograms/Windows/.As the audio files can range between 30 seconds to couple of minutes, it has to be windowed. The windows were tasken has a hop length of 2 seconds and window frame size of 4 seconds. The script will divided to train, test and validation and lets the images be saved in the respective directories

An example of some of the pre-saved Mel-spectrograms and Windowed mel spectrograms are below:

Transcripts and Text Preprocessing

Text Preprocessing:

All videos have manual transcription. Glove word embeddings (Pennington et al., 2014) were used to extract word vectors from transcripts. Words and audio are aligned at phoneme level using P2FA forced alignment model (Yuan and Liberman, 2008). Following this, acoustic modalities are aligned to the words by interpolation. Since the utterance duration of words in English is usually short, this interpolation does not lead to substantial information loss.

Scripts :

Scripts to correlate Audio length and with the texts from transcripts. Here we try to find the word length distribution for the audio files(.wav) in _Emotion_Data/CMU_MOSEI/Raw/Audio/Full/WAV16000/ in the segment with start and compare if the text and the words spoken correalate. _Shreyah_code/MOSEI/Preprocessing/Scripts/audio_textcorrelation.ipynb

Models and Results:

The following Deep Learning models were explored with the datasets mentioned and their results have also been demonstrated below :

    1. Fusion
      • Alexnet + Bert
      • Wav2vec2 + Bert

Audio

Resnet

Model Directory : SER/Models/Audio/Pretrained_models

Result

Classification report
                   precision    recall  f1-score   support

           happy       0.67      0.80      0.73       140
         disgust       0.56      0.68      0.61       140
         neutral       0.62      0.64      0.63       140
             sad       0.67      0.56      0.61       140
            fear       0.79      0.77      0.78       120
           angry       0.59      0.45      0.51       140

        accuracy                           0.65       820
       macro avg       0.65      0.65      0.65       820
    weighted avg       0.65      0.65      0.64       820




  • TESS:

    • With early stopping, for 8 emotions (FastAi code): Shreyah_code/Pretrained/src/TESS/TESS-RESNET50.ipynb

      Classification report

                       precision    recall  f1-score   support
      
               happy       0.67      0.80      0.73       140
             disgust       0.56      0.68      0.61       140
             neutral       0.62      0.64      0.63       140
                 sad       0.67      0.56      0.61       140
                fear       0.79      0.77      0.78       120
               angry       0.59      0.45      0.51       140
      
            accuracy                           0.65       820
           macro avg       0.65      0.65      0.65       820
        weighted avg       0.65      0.65      0.64       820
  • EMOCAP:

    • With early stopping, for 8 emotions (FastAi code): Shreyah_code/Pretrained/src/EMOCAP/IEMOCAP_resnet50.ipynb

        Classification report
                       precision    recall  f1-score   support
      
                 fru       0.51      0.51      0.51       170
                 exc       0.53      0.21      0.30       300
                 sad       0.35      0.42      0.38       381
                 neu       0.46      0.65      0.54       384
                 ang       0.68      0.51      0.58       245
      
            accuracy                           0.46      1480
           macro avg       0.50      0.46      0.46      1480
        weighted avg       0.49      0.46      0.45      1480

DenseNet

Model Directory : SER/Models/Audio/Pretrained_models/RAVDESS

DenseNet-Result

  • CREMA-D :

    • Early-stopping : Shreyah_code/Pretrained/src/CREMA/CREMA-6emo_DENSENET201.ipynb

      Classification report

                   precision    recall  f1-score   support
      
           happy       0.67      0.68      0.67       140
         disgust       0.55      0.51      0.53       140
         neutral       0.56      0.44      0.49       140
             sad       0.57      0.53      0.55       140
            fear       0.78      0.47      0.58       120
           angry       0.42      0.72      0.53       140
      
        accuracy                           0.56       820
       macro avg       0.59      0.56      0.56       820
      

      weighted avg 0.59 0.56 0.56 820

  • RAVDESS :

    • With early stopping and 8 emotions with FastAi: Shreyah_code/Pretrained/src/RAVDESS/RAVDESS-Emotions-Densenet.ipynb

                Classification report
                               precision    recall  f1-score   support
      
                   surprised       0.77      0.83      0.80        24
                       happy       0.68      0.54      0.60        24
                     fearful       0.52      0.46      0.49        24
                     disgust       0.78      0.88      0.82        24
                     neutral       0.68      0.79      0.73        24
                         sad       0.42      0.46      0.44        24
                        calm       0.50      0.58      0.54        12
                       angry       0.84      0.67      0.74        24
      
                    accuracy                           0.66       180
                   macro avg       0.65      0.65      0.65       180
                weighted avg       0.66      0.66      0.65       180
    • With Windowing and Post processing 8 emotions with FastAi:Shreyah_code/Pretrained/src/RAVDESS/RAVDESS_Emotions-DENSENET201_windowing-post_processing.ipynb

                Classification report
                               precision    recall  f1-score   support
      
                   surprised       0.75      0.56      0.64        16
                       happy       0.67      0.88      0.76        16
                     fearful       0.57      0.81      0.67        16
                     disgust       0.63      0.75      0.69        16
                     neutral       0.58      0.44      0.50        16
                         sad       0.00      0.00      0.00         8
                        calm       0.00      0.00      0.00         0
                       angry       0.45      0.31      0.37        16
                        none       0.85      0.69      0.76        16
      
                    accuracy                           0.59       120
                   macro avg       0.50      0.49      0.49       120
                weighted avg       0.60      0.59      0.58       120
  • EMOCAP :
     - With early stopping and 3 emotions with FastAi:

Shreyah_code/Pretrained/src/EMOCAP/IEMOCAP_DESNET_3lb.ipnb

ResneXt

Model Directory : SER/Models/Audio/Pretrained_models

ResNeXT-Result

  • CREMA-D :

    • With early stopping and 3 emotions (FastAi code) : Shreyah_code/Pretrained/src/CREMA/CREMA-Resnext-3lb.ipynb

      • Classification report

                       precision    recall  f1-score   support
        
               happy       0.76      0.91      0.82       140
             neutral       0.85      0.74      0.79       140
               angry       0.95      0.87      0.90       120
        
            accuracy                           0.84       400
           macro avg       0.85      0.84      0.84       400
        weighted avg       0.85      0.84      0.84       400
      • Windowing for 6 emotions (FastAi code) and Post processing : Shreyah_code/Pretrained/src/CREMA/CREMA-windows_ResNext50.ipynb

      • Classification report

                       precision    recall  f1-score   support
        
               happy       0.18      0.48      0.26       789
             disgust       0.40      0.39      0.40      3452
             neutral       0.34      0.18      0.23      2873
                 sad       0.59      0.31      0.41      2473
                fear       0.38      0.44      0.41      2503
               angry       0.37      0.46      0.41      3333
        
            accuracy                           0.37     15423
           macro avg       0.37      0.38      0.35     15423
        weighted avg       0.40      0.37      0.36     15423
  • RAVDESS :

  • TESS:

    • With early stopping, windowing for 8 emotions (FastAi code): Shreyah_code/Pretrained/src/TESS/TESS-Windowing.ipynb

       Classification report
                      precision    recall  f1-score   support
      
              happy       1.00      1.00      1.00        44
            disgust       1.00      1.00      1.00        43
            neutral       1.00      1.00      1.00        43
                sad       0.98      0.98      0.98        42
                 ps       1.00      1.00      1.00        43
               fear       1.00      0.98      0.99        42
              angry       0.98      1.00      0.99        43
      
           accuracy                           0.99       300
          macro avg       0.99      0.99      0.99       300
       weighted avg       0.99      0.99      0.99       300
  • IEMOCAP :

    • With early stopping and 3 emotions (FastAi code): Shreyah_code/Pretrained/src/EMOCAP/IEMOCAP_RESNEXT_3lb.ipynb

               Classification report
                      precision    recall  f1-score   support
      
                NEU       0.69      0.79      0.74       170
                HAP       0.47      0.14      0.22       143
                ANG       0.71      0.85      0.78       384
      
           accuracy                           0.69       697
          macro avg       0.62      0.60      0.58       697
       weighted avg       0.66      0.69      0.65       697
      
      
      • With early stopping and 8 emotions (FastAi code): Shreyah_code/Pretrained/src/EMOCAP/IEMOCAP_resnext.ipynb

          Classification report
                         precision    recall  f1-score   support
        
                   ang       0.54      0.47      0.50       170
                   exc       0.53      0.39      0.45       300
                   fru       0.47      0.50      0.49       381
                   neu       0.51      0.72      0.60       384
                   sad       0.70      0.47      0.57       245
        
              accuracy                           0.53      1480
             macro avg       0.55      0.51      0.52      1480
          weighted avg       0.54      0.53      0.52      1480

Alexnet

Model Directory : SER/Models/Audio/Attention and Alexnet/IEMOCAP/Alexnet/

ALEXNET-Result

CNN

Convolutional Neural Network

2DCNN-RAVDESS

2DCNN_RESULTS_MODEL1
  • RAVDESS:

    • 2D CNN Model 1:

      • With early stopping and 6 emotions with Model 1 (2 layers)(Keras code): Shreyah_code/SER_CNN/Ravdess_actor_split/Ravdess_cv_6label40_ft-2dcnn_model1.ipynb


        Classification report for Set1

                 precision    recall  f1-score   support
        
             angry       0.41      0.31      0.35        64
           disgust       0.48      0.69      0.57        64
              fear       0.48      0.55      0.51        64
             happy       0.30      0.27      0.28        64
           neutral       0.39      0.47      0.43        32
               sad       0.30      0.20      0.24        64
        
          accuracy                          0.41       352
          macro avg     0.39      0.41      0.40       352
          weighted avg  0.39      0.41      0.39       352
        
        

        Classification report for Set2

             precision    recall  f1-score   support
        
         angry       0.92      0.89      0.90        64
        

        disgust 0.92 0.86 0.89 64

          fear       0.84      0.92      0.88        64
         happy       0.92      0.88      0.90        64
        

        neutral 0.72 0.88 0.79 32

           sad       0.92      0.86      0.89        64
        
        

        accuracy 0.88 352 macro avg 0.87 0.88 0.87 352 weighted avg 0.89 0.88 0.88 352

        Classification report for Set3

             precision    recall  f1-score   support
        
         angry       0.84      0.84      0.84        64
        

        disgust 0.85 0.94 0.89 64

          fear       0.91      0.91      0.91        64
         happy       0.79      0.72      0.75        64
        

        neutral 0.81 0.91 0.85 32

           sad       0.93      0.86      0.89        64
        
        

        accuracy 0.86 352 macro avg 0.85 0.86 0.86 352 weighted avg 0.86 0.86 0.86 352

        Classification report for Set4

                     precision    recall  f1-score   support
        
                     angry       0.56      0.56      0.56        64
                   disgust       0.64      0.81      0.72        64
                      fear       0.77      0.72      0.74        64
                     happy       0.59      0.56      0.58        64
                   neutral       0.59      0.72      0.65        32
                       sad       0.68      0.50      0.58        64
        
                  accuracy                           0.64       352
                 macro avg       0.64      0.65      0.64       352
              weighted avg       0.64      0.64      0.64       352
        
        

        Classification report for Set5

                                  precision    recall  f1-score   support
        
                 angry       0.62      0.52      0.56        64
               disgust       0.80      0.73      0.76        64
                  fear       0.59      0.84      0.70        64
                 happy       0.50      0.50      0.50        64
               neutral       0.66      0.66      0.66        32
                   sad       0.72      0.59      0.65        64
        
              accuracy                           0.64       352
             macro avg       0.65      0.64      0.64       352
          weighted avg       0.65      0.64      0.64       352
        
        

        Classification report for Set6

                         precision    recall  f1-score   support
        
                 angry       0.71      0.64      0.67        64
               disgust       0.55      0.83      0.66        64
                  fear       0.80      0.61      0.69        64
                 happy       0.65      0.53      0.59        64
               neutral       0.48      0.75      0.59        32
                   sad       0.64      0.47      0.54        64
        
              accuracy                           0.63       352
             macro avg       0.64      0.64      0.62       352
          weighted avg       0.65      0.63 
2DCNN_RESULTS_MODEL2
  • 2D CNN Model 2:

      - With early stopping and 6 emotions Model 2 6 layers(Keras code): 
    

    Shreyah_code/SER_CNN/Ravdess_actor_split/Ravdess_cv_6label40_ft-2dcnn_model2.ipynb

                   Classification report for Set1
                         precision    recall  f1-score   support
    
                 angry       0.41      0.31      0.35        64
               disgust       0.48      0.69      0.57        64
                  fear       0.48      0.55      0.51        64
                 happy       0.30      0.27      0.28        64
               neutral       0.39      0.47      0.43        32
                   sad       0.30      0.20      0.24        64
    
              accuracy                           0.41       352
             macro avg       0.39      0.41      0.40       352
          weighted avg       0.39      0.41      0.39       352
    

          Classification report for Set2
                         precision    recall  f1-score   support
    
                 angry       0.92      0.89      0.90        64
               disgust       0.92      0.86      0.89        64
                  fear       0.84      0.92      0.88        64
                 happy       0.92      0.88      0.90        64
               neutral       0.72      0.88      0.79        32
                   sad       0.92      0.86      0.89        64
    
              accuracy                           0.88       352
             macro avg       0.87      0.88      0.87       352
          weighted avg       0.89      0.88      0.88       352
    

      Classification report for Set3
                     precision    recall  f1-score   support
    
             angry       0.84      0.84      0.84        64
           disgust       0.85      0.94      0.89        64
              fear       0.91      0.91      0.91        64
             happy       0.79      0.72      0.75        64
           neutral       0.81      0.91      0.85        32
               sad       0.93      0.86      0.89        64
    
          accuracy                           0.86       352
         macro avg       0.85      0.86      0.86       352
      weighted avg       0.86      0.86      0.86       352
    

      Classification report for Set4
                     precision    recall  f1-score   support
    
             angry       0.56      0.56      0.56        64
           disgust       0.64      0.81      0.72        64
              fear       0.77      0.72      0.74        64
             happy       0.59      0.56      0.58        64
           neutral       0.59      0.72      0.65        32
               sad       0.68      0.50      0.58        64
    
          accuracy                          0.64       352
           macro avg    0.64      0.65      0.64       352
           weighted avg 0.64      0.64      0.64       352
    

      Classification report for Set5
                     precision    recall  f1-score   support
    
             angry       0.62      0.52      0.56        64
           disgust       0.80      0.73      0.76        64
              fear       0.59      0.84      0.70        64
             happy       0.50      0.50      0.50        64
           neutral       0.66      0.66      0.66        32
               sad       0.72      0.59      0.65        64
    
          accuracy                           0.64       352
         macro avg       0.65      0.64      0.64       352
      weighted avg       0.65      0.64      0.64       352
    

      Classification report for Set6
                     precision    recall  f1-score   support
    
             angry       0.71      0.64      0.67        64
           disgust       0.55      0.83      0.66        64
              fear       0.80      0.61      0.69        64
             happy       0.65      0.53      0.59        64
           neutral       0.48      0.75      0.59        32
               sad       0.64      0.47      0.54        64
    
          accuracy                           0.63       352
         macro avg       0.64      0.64      0.62       352
      weighted avg       0.65      0.63      0.63       352

1DCNN-RAVDESS

1DCNN_RESULTS_MODEL1,2,3
  • RAVDESS:

    • 1D CNN Model 1:

      • With early stopping and 6 emotions with Model 1 -1D CNN(3 layers, 5 layers, 8 layers)(Keras code): Shreyah_code/SER_CNN/Ravdess_actor_split/Ravdess_cv_6label40_mod_ft_1d%2C358.ipynb

        Model 1 Set 1
                     precision    recall  f1-score   support
        
             angry       0.90      0.94      0.92        64
           disgust       0.90      0.86      0.88        64
              fear       0.93      0.80      0.86        64
             happy       0.87      0.95      0.91        64
           neutral       0.85      0.91      0.88        32
               sad       0.86      0.88      0.87        64
        
          accuracy                           0.89       352
         macro avg       0.89      0.89      0.89       352
        

        weighted avg 0.89 0.89 0.89 352

           Model 2 Set 1
                         precision    recall  f1-score   support
        
                 angry       0.91      0.97      0.94        64
               disgust       0.88      0.89      0.88        64
                  fear       0.96      0.80      0.87        64
                 happy       0.89      0.91      0.90        64
               neutral       0.90      0.88      0.89        32
                   sad       0.87      0.95      0.91        64
        
              accuracy                           0.90       352
             macro avg       0.90      0.90      0.90       352
          weighted avg       0.90      0.90      0.90       352
        

           Model 3 Set 1
                         precision    recall  f1-score   support
        
                 angry       0.87      0.94      0.90        64
               disgust       0.88      0.83      0.85        64
                  fear       0.98      0.62      0.76        64
                 happy       0.71      0.92      0.80        64
               neutral       0.49      0.97      0.65        32
                   sad       0.94      0.53      0.68        64
        
              accuracy                           0.79       352
             macro avg       0.81      0.80      0.78       352
          weighted avg       0.84      0.79      0.79       352
1DCNN_RESULTS_MODEL1,2,3 for mean of features in Time domain
  • RAVDESS:

    • 1D CNN Model 1:

      • With early stopping and 6 emotions with Model 1

        • 13 features
        • 1D CNN(3 layers, 5 layers, 8 layers)(Keras code): Shreyah_code/SER_CNN/Ravdess_actor_split/Ravdess_cv_6label_time_13-Copy1.ipynb

                          Model 1 Set 1
                       precision    recall  f1-score   support
          
               angry       0.57      0.52      0.54        64
             disgust       0.64      0.55      0.59        64
                fear       0.64      0.56      0.60        64
               happy       0.43      0.58      0.49        64
             neutral       0.36      0.50      0.42        32
                 sad       0.65      0.53      0.59        64
          
            accuracy                           0.54       352
           macro avg       0.55      0.54      0.54       352
          

          weighted avg 0.57 0.54 0.55 352

                         Model 2 Set 1
                       precision    recall  f1-score   support
          
               angry       0.53      0.52      0.52        64
             disgust       0.51      0.66      0.58        64
                fear       0.51      0.48      0.50        64
               happy       0.43      0.50      0.46        64
             neutral       0.48      0.34      0.40        32
                 sad       0.45      0.34      0.39        64
          
            accuracy                           0.49       352
           macro avg       0.48      0.47      0.47       352
          

          weighted avg 0.48 0.49 0.48 352

                          Model 3 Set 1
                       precision    recall  f1-score   support
          
               angry       0.48      0.50      0.49        64
             disgust       0.54      0.66      0.59        64
                fear       0.67      0.45      0.54        64
               happy       0.56      0.50      0.53        64
             neutral       0.52      0.47      0.49        32
                 sad       0.54      0.66      0.59        64
          
            accuracy                           0.55       352
           macro avg       0.55      0.54      0.54       352
          

          weighted avg 0.55 0.55 0.54 352

1DCNN_RESULTS_MODEL1,2,3 for mean of features in Freq domain
  • RAVDESS:

    • 1D CNN Model 1:

      • With early stopping and 6 emotions with Model 1
      • 13 feature
      • 1D CNN(3 layers, 5 layers, 8 layers)(Keras code): Shreyah_code/SER_CNN/Ravdess_actor_split/Ravdess_cv_6label13_ft_1d.ipynb

                                     Model 1 Set 1
        
                          precision    recall  f1-score   support
        
                     angry       0.92      0.92      0.92        64
                   disgust       0.88      0.91      0.89        64
                      fear       0.90      0.81      0.85        64
                     happy       0.84      0.88      0.85        64
                   neutral       0.87      0.81      0.84        32
                       sad       0.87      0.91      0.89        64
        
                  accuracy                           0.88       352
                 macro avg       0.88      0.87      0.87       352
              weighted avg       0.88      0.88      0.88       352
        

                        Model 2 Set 1
                             precision    recall  f1-score   support
        
                     angry       0.86      0.94      0.90        64
                   disgust       0.91      0.83      0.87        64
                      fear       0.92      0.89      0.90        64
                     happy       0.92      0.95      0.94        64
                   neutral       0.88      0.88      0.88        32
                       sad       0.94      0.94      0.94        64
        
                  accuracy                           0.91       352
                 macro avg       0.90      0.90      0.90       352
              weighted avg       0.91      0.91      0.91       352
        

                            Model 3 Set 1
                         precision    recall  f1-score   support
        
                 angry       0.94      0.92      0.93        64
               disgust       0.90      0.88      0.89        64
                  fear       0.92      0.88      0.90        64
                 happy       0.84      0.98      0.91        64
               neutral       0.97      0.91      0.94        32
                   sad       0.98      0.94      0.96        64
        
              accuracy                           0.92       352
             macro avg       0.92      0.92      0.92       352
          weighted avg       0.92      0.92      0.92       352

Attention

CNN+LSTM

Model Directory : SER/Models/Audio/Attention and Alexnet/IEMOCAP/Attention/

Result

CNN+LSTM+Attention

Result

Transformer Based- Wav2Vec2 Fine tunining

Wav2Vec 2.0 is a pretrained model for Automatic Speech Recognition (ASR) and was released in September 2020 by Alexei Baevski, Michael Auli, and Alex Conneau. Soon after the superior performance of Wav2Vec2 was demonstrated on the English ASR dataset LibriSpeech, Facebook AI presented XLSR-Wav2Vec2 (click here). XLSR stands for cross-lingual speech representations and refers to XLSR-Wav2Vec2`s ability to learn speech representations that are useful across multiple languages.

Similar to Wav2Vec2, XLSR-Wav2Vec2 learns powerful speech representations from hundreds of thousands of hours of speech in more than 50 languages of unlabeled speech. Similar, to BERT's masked language modeling, the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

wav2vec2_structure

The authors show for the first time that massively pretraining an ASR model on cross-lingual unlabeled speech data, followed by language-specific fine-tuning on very little labeled data achieves state-of-the-art results. See Table 1-5 of the official paper.

Preprocessing :

In order to preprocess the audio into our classification model, we need to set up the relevant Wav2Vec2 assets regarding our language in this case. Since most of the data available is English I am using the foolowing fine tuned model jonatasgrosman/wav2vec2-large-xlsr-53-english fine-tuned by jonatasgrosman. To handle the context representations in any audio length we use a merge strategy plan (pooling mode) to concatenate that 3D representations into 2D representations.

So far, we downloaded, loaded, and split the SER dataset into train and test sets. The instantiated our strategy configuration for using context representations in our classification problem SER. Now, we need to extract features from the audio path in context representation tensors and feed them into our classification model to determine the emotion in the speech.

Since the audio file is saved in the .wav format, it is easy to use Librosa or others like Torchaudio.

An audio file usually stores both its values and the sampling rate with which the speech signal was digitalized. We want to store both in the dataset and write a map(...) function accordingly. Also, we need to handle the string labels into integers for our specific classification task in this case, the single-label classification you may want to use for your regression or even multi-label classification.ipynb_checkpoints/

Model Directory :

  • _SER/Models/Audio/Wav2Vec2/Wav2vec2RAVDESS/
  • _SER/Models/Audio/Wav2Vec2/Wav2Vec2IEMOCAP/
  • _SER/Models/Audio/Wav2Vec2/Wav2Vec2Greek/
  • _SER/Models/Audio/Wav2Vec2/Wav2Vec2CMU/

Weights and Baise Experiment Tracking :

The wandb Python library is used to track Wav2vec2 experiments. It can be intergrated with framework like PyTorch or Keras. https://wandb.ai/shreyah/EMOCAP/reports/Shared-panel-21-12-10-08-12-42--VmlldzoxMzI1MTY3

In [28]:
from IPython.display import IFrame  
IFrame("https://wandb.ai/shreyah/EMOCAP/reports/IEMOCAP-results-with-Wav2Vec2-for-10-sets--VmlldzoxMzI1MTY3", width=800, height=650)
Out[28]:

WAV2VEC2-EMOCAP

Result
  • IEMOCAP:

The results obtained from Wav2Vec2 is far more improved than the previous models like CNN, Alexnet , Alexnet + Attention.

The scripts for cross validation for the other test sets can be found below:


Classification report precision recall f1-score support

           angry       0.71      0.83      0.76        78
           happy       0.84      0.63      0.72       236
         neutral       0.59      0.81      0.69       192
             sad       0.75      0.60      0.67       138

        accuracy                           0.70       644
       macro avg       0.72      0.72      0.71       644
    weighted avg       0.73      0.70      0.70       644

WAV2VEC2-GREEK

Result
  • GREEK:

The scripts for training the greek emotion data on Wav2Vec2 can be seen below:

WAV2VEC2-RAVDESS

Result
  • RAVDESS:

The scripts for training the RAVDESS emotion data on Wav2Vec2 can be seen below:

  • Results for Wav2vec2 finetuning for 8 emotions on RAVDESS dataset(PyTorch code) on Set1 test data is: Shreyah_code/IEMOCAP/Audio/CNN_Transformers/Ravdess_WAV2vec_set1.ipynb

       Classification report
    
                     precision    recall  f1-score   support
    
              angry       0.84      0.81      0.83        32
               calm       0.81      0.69      0.75        32
            disgust       0.91      0.97      0.94        32
               fear       0.72      0.97      0.83        32
              happy       0.62      0.56      0.59        32
            neutral       0.83      0.62      0.71        16
                sad       0.62      0.50      0.55        32
           surprise       0.76      0.91      0.83        32
    
           accuracy                           0.76       240
          macro avg       0.76      0.75      0.75       240
       weighted avg       0.76      0.76      0.76       240
    
       F1: 0.7527456698170306
       acc: 0.7625

WAV2VEC2-CMU

Result
  • CMU:
  • The scripts for training the CMU REGRESSION emotion data on Wav2Vec2 can be seen below:

    • Wav2vec2 finetuning for CMU data considering multiple labels and performing Regression for 6 emotions (PyTorch code):Shreyah_code/MOSEI/Wav2vec/Audio/src/Regression_model-Windowed-4sec.ipynb

      Next, the evaluation metric is defined. There are many pre-defined metrics for classification/regression problems, but in this case, we would continue with just Accuracy for classification and MSE for regression.

      • Threshold: Let's assume we are using 0.5 as the threshold for prediction

                       precision    recall  f1-score   support
        
               happy       0.65      0.64      0.65       746
                 sad       0.47      0.54      0.50       559
               anger       0.59      0.30      0.39       613
            surprise       0.00      0.00      0.00       213
             disgust       0.60      0.55      0.57       545
                fear       0.00      0.00      0.00        76
        
           micro avg       0.58      0.46      0.51      2752
           macro avg       0.38      0.34      0.35      2752
        weighted avg       0.52      0.46      0.48      2752
         samples avg       0.48      0.43      0.43      2752
      • Threshold: Let's assume we are using 0.3 as the threshold for prediction
                      precision    recall  f1-score   support

               happy       0.64      0.64      0.64       746
                 sad       0.47      0.54      0.50       559
               anger       0.60      0.30      0.40       613
            surprise       0.00      0.00      0.00       213
             disgust       0.59      0.53      0.56       545
                fear       0.00      0.00      0.00        76

           micro avg       0.57      0.45      0.51      2752
           macro avg       0.38      0.33      0.35      2752
        weighted avg       0.52      0.45      0.47      2752
         samples avg       0.48      0.43      0.43      2752       

Result
  • CMU Classification:

    • Wav2vec2 finetuning for CMU data considering the emotions HAPPY,ANGRY,SAD,DISGUST,SURPRISE,FEAR and performing CLASSIFICATION as only the audios with a single emotion is considered for 7 emotions (PyTorch code):Shreyah_code/MOSEI/Wav2vec/Audio/src/Training_Windowed-CMU_classification_7emotions.ipynb
      • The scripts for training the CMU Classification emotion data on Wav2Vec2 can be seen below: This script will train the windowed files on CMU in the directory. The windowed data is stored as py arrow datasets.For the emotions considered here (HAPPY,ANGRY,SAD,DISGUST,SURPRISE,FEAR) and the label must be 1, meaning only 1 emotion data points are considered. The results thus obtained considers regression values.

  • Classification report

                        precision    recall  f1-score   support
    
              Anger       0.15      0.49      0.24       503
            Disgust       0.05      0.28      0.08       140
               Fear       0.04      0.15      0.07       102
              Happy       0.60      0.52      0.56      4031
            Neutral       0.48      0.18      0.26      3480
                Sad       0.23      0.31      0.26      1040
           Surprise       0.00      0.04      0.01        49
    
           accuracy                           0.36      9345
          macro avg       0.22      0.28      0.21      9345
       weighted avg       0.47      0.36      0.38      9345
    
    
 - RESULTS BEFORE POST-PROCESSING ON TEST:


         - Classification report
                           precision    recall  f1-score   support

               Anger       0.62      0.09      0.15      1600
               Happy       0.63      0.53      0.58      3462
             Neutral       0.22      0.31      0.26      1324
                 Sad       0.30      0.61      0.40      1361

            accuracy                           0.41      7747
           macro avg       0.44      0.38      0.35      7747
        weighted avg       0.50      0.41      0.40      7747


  • RESULTS AFTER POST-PROCESSING ON TEST:

         - Classification report
                 precision    recall  f1-score   support

       Anger       0.39      0.08      0.13       216
       Happy       0.76      0.60      0.67      1586
     Neutral       0.35      0.33      0.34       691
         Sad       0.21      0.59      0.31       306

    accuracy                           0.49      2799
   macro avg       0.43      0.40      0.36      2799
weighted avg       0.57      0.49      0.51      2799        

Text

BERT

Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. Google is leveraging BERT to better understand user searches.

The BERT has a superior performance in the task for Emotion recogtion from Text because the model the text representation from both directions to get a better sense of the context and the relationship. Unlike other models that looked at the data from left to right or left to right. How it works is that it takes the input sentence and learn the representation bi-directionally. The trained bi-directional language transformer models captured the relationship and the language context much better and accurately.

Pre-processing_BERT


BERT has a constraint on the maximum length of a sequence after tokenizing. For any BERT model, the maximum sequence length after tokenization is 512. But we set a sequence length of 64 as the token count is distribution can be seen for the EMOCAP/CMU dataset in this range .

  • The text data from IEMOCAP dataset(sentences) and CMU dataset is used.
  • Every sentence extracted is passed to BERT model and represented as a 768 dimension vector.
Tokenization

Input data needs to be prepared in a special way. BERT uses WordPiece embeddings (Wu et al.,2016) with a 30,000 token vocabulary. There are 2 special tokens that are introduced in the text –

  • a token [SEP] to separate two sentences, and
  • a classification token [CLS] which is the first token of every tokenized sequence.
  • The authors state that the final hidden state corresponding to the [CLS] token is used as the aggregate sequence representation for classification tasks. The BertTokenizer are used to preprocess the text data. The BertTokenizer are made available as part of the Hugging Face Transformer library to perform the above operations. This results in the following tensors to feed into the model later:
    • input_ids
    • attention_masks

Model :

BERT’s model architecture is a multi-layer bidirectional Transformer encoder. For the Finetuning on CMU and MOSEI data I used the BERT “base” model which has 12 Transformer blocks or layers, 16 self-attention heads, hidden size of 768.

Model Directory :

  • SER/Models/Text/IEMOCAP
  • SER/Models/Text/CMU

BERT-IEMOCAP:

  • The text data extracted from each of the IEMOCAP dataset (sentences) is passed on to a BERT model :

    • IEMOCAP Scripts to extract sentences is located in: Shreyah_code/IEMOCAP/Audio/Preprocessing_script.ipynb. Running the script will produce the files processed_tran.csv and processed_label.txt which contain the SessionID, Dialogue and the labels respectively extracted from transcripts

    • Training EMOCAP text on Bert:

      • The Scripts to train the text model for Bert on IEMOCAP for 4 emotions (ANGRY,SAD,HAPPY,NEUTRAL) for a classification (PyTorch code) is in: Shreyah_code/IEMOCAP/Text/IEMOCAP-Text_4_emo.ipynb

        • The Scripts to train the text model with Bert on IEMOCAP for 3 emotions (ANGRY,SAD,HAPPY) classification (PyTorch code) is in : link
  • Saved Model CHECKPOINT Directory for IEMOCAP CLASSIFICATION: link
Result
  • IEMOCAP Classification:

    • IEMOCAP Scripts to extract sentences is located in: Shreyah_code/IEMOCAP/Audio/Preprocessing_script.ipynb. Running the script will produce the files processed_tran.csv and processed_label.txt which contain the SessionID, Dialogue and the labels respectively extracted from transcripts

  • Classification report

                               precision    recall  f1-score   support
    
                       angry       0.52      0.73      0.61        78
           happy and excited       0.83      0.56      0.67       236
                         sad       0.61      0.58      0.59       138
                     neutral       0.55      0.70      0.62       192
    
                    accuracy                           0.63       644
                   macro           0.63      0.64      0.62       644
                weighted avg       0.66      0.63      0.63       644
    
    
    • The Scripts to train the text model with Bert on IEMOCAP for 3 emotions (ANGRY,SAD,HAPPY) classification (PyTorch code) is in : Shreyah_code/IEMOCAP/Text/IEMOCAP-Text_4_emo.ipynb

      Classification report
                                 precision    recall  f1-score   support
      
                         angry       0.63      0.85      0.72        78
             happy and excited       0.88      0.75      0.81       236
                           sad       0.71      0.75      0.73       138
      
                      accuracy                           0.77       452
                     macro avg       0.74      0.78      0.75       452
                  weighted avg       0.78      0.77      0.77       452

BERT-CMU:

These pre-processed files once loaded into a dataframe the data looks like the table below:

  • This text data is cleaned to remove any unicode charachters and punctuations. The text data extracted from each of the CMU data (segmented sentences)as tokens and is passed on to a BERT model.
  • As CMU is a multilabel muticlass dataset, only the data points containing 1 label at a given instance of time is considered. The emotions which have more than 1 is discarded while training and the datapoints where no emotion exisit is considered as Neutral. This procedure is followed for CMU data to train the BERT for emotion classification.

    • Training EMOCAP text on Bert:

      • The Scripts to train the text model for Bert on IEMOCAP for 5 emotions (ANGRY,SAD,HAPPY,NEUTRAL,DISGUST) for a classification (PyTorch code) is: Shreyah_code/MOSEI/Text/Emotion/Bert/Text_Preprocessing_Bert_Multiclass_4_emo.ipynb


        Classification report

                            precision    recall  f1-score   support
        
                 Happy       0.63      0.66      0.65      2014
                   Sad       0.32      0.24      0.27       594
               Neutral       0.36      0.47      0.41      1338
                 Anger       0.41      0.12      0.19       537
        
              accuracy                           0.48      4483
             macro avg       0.43      0.37      0.38      4483
          weighted avg       0.48      0.48      0.47      4483

The results shows that the test set performs well with Happy and Neutral only.To improve the results for Sad and Anger, the data from other sources such as EMOCAP and SMILE data shown below were combined to balance the emotions and to see it there was any improvement in the test set.

  • The Scripts combines IEMOCAP with the CMU data inorder to balance the emotion imbalance for 4 emotions (ANGRY,SAD,HAPPY,NEUTRAL) for a classification (PyTorch code). After adding the IEMOCAP data, the same CMU test data was used to evaluate. The data distribution after adding EMOCAP :
    • Happy 6809
    • Neutral 6153
    • Sad 3074
    • Anger 2693

The training data before adding EMOCAP's Sad and Anger :

The Scripts combines IEMOCAP with the CMU data inorder to balance the emotion imbalance for 4 emotions (ANGRY,SAD,HAPPY,NEUTRAL) for a classification (PyTorch code). After adding the IEMOCAP data, the same CMU test data was used to evaluate. The data distribution after adding EMOCAP :

Shreyah_code/MOSEI/Text/Emotion/Bert/Text%20Preprocessing-Bert_Multiclass-4%20emotions-Emocap%2BMOSEI%2BSMILE.ipynb

The training data after adding EMOCAP and SMILE's Sad and Anger emotion :

  • Sad 8839
  • Happy 6809
  • Neutral 6153
  • Anger 5420

    Classification report

                                      precision    recall  f1-score   support
    
                             Anger       0.40      0.14      0.21       487
                             Happy       0.60      0.77      0.68      1955
                               Sad       0.28      0.10      0.14       539
                           Neutral       0.38      0.42      0.40      1257
    
                          accuracy                           0.51      4238
                         macro avg       0.41      0.36      0.36      4238
                      weighted avg       0.47      0.51      0.47      4238
    
    

RoBERTa

The RoBERTa model (Liu et al., 2019) introduces some key modifications above the BERT MLM (masked-language modeling) training procedure. The authors highlight “the importance of exploring previously unexplored design choices of BERT”. Details of these design choices can be found in the paper’s Experimental Setup section.

RoBERTa is trained on BookCorpus (Zhu et al., 2015), amongst other datasets. A recently published work BerTweet (Nguyen et al., 2020) provides a pre-trained BERT model (using the RoBERTa procedure) on vast Twitter corpora in English. They argue that BerTweet better models the characteristic of language used on the Twitter subspace, outperforming previous SOTA models on Tweet NLP tasks. Hence, it is a good indicator that the performance on downstream tasks is greatly influenced by what our LM captures!

Similary RoBERTa has achieved a bench mark in IEMOCAP data https://paperswithcode.com/sota/emotion-recognition-in-conversation-on.

Therefore I am using RoBERTa to analyse the improvement on the CMU data

RoBERTa in the Transformers library The 🤗Transformers library comes bundled with classes & utilities to apply various tasks on the RoBERTa model.

RoBERTa-CMU:

These pre-processed files once loaded into a dataframe the data looks like the table below:

  • This text data is cleaned to remove any unicode charachters and punctuations. The text data extracted from each of the CMU data (segmented sentences)as tokens and is passed on to a RoBERTa model.
  • As CMU is a multilabel muticlass dataset, only the data points containing 1 label at a given instance of time is considered. The emotions which have more than 1 is discarded while training and the datapoints where no emotion exisit is considered as Neutral. This procedure is followed for CMU data to train the RoBERTa for emotion classification.

    • Training EMOCAP text on RoBERTa:

      • The Scripts to train the text model for Bert on IEMOCAP for 4 emotions (ANGRY,SAD,HAPPY,NEUTRAL) for a classification (PyTorch-Lightning code) is:

        Shreyah_code/MOSEI/Text/Emotion/RobertaBert/Text%20Preprocessing-RobertaBert_4_emotions.ipynb


        Classification report

                        precision    recall  f1-score   support
        
                 Happy     0.6461    0.8613    0.7383      1651
                   Sad     0.2661    0.0971    0.1422       340
                 Anger     0.0000    0.0000    0.0000       227
               Neutral     0.3708    0.3190    0.3429       765
        
              accuracy                         0.5696      2983
             macro avg     0.3208    0.3193    0.3059      2983
          weighted avg     0.4830    0.5696    0.5128      2983

The results shows that the test set performs well with Happy and Neutral only. Anger and Sad seems to show a poor performance.To improve the results for Sad and Anger, the data from other sources such as EMOCAP and SMILE data shown below were combined to balance the emotions and to see it there was any improvement in the test set.

  • The Scripts combines IEMOCAP with the CMU data inorder to balance the emotion imbalance for 4 emotions (ANGRY,SAD,HAPPY,NEUTRAL) for a classification (PyTorch code). After adding the IEMOCAP, SMILE data, the same CMU test data was used to evaluate. The training data before adding EMOCAP's Sad and Anger :
    • Happy 6809
    • Neutral 4445
    • Sad 1990
    • Anger 1590 Location:

The Scripts combines IEMOCAP with the CMU data inorder to balance the emotion imbalance for 4 emotions (ANGRY,SAD,HAPPY,NEUTRAL) for a classification (PyTorch code). After adding the IEMOCAP data, the same CMU test data was used to evaluate. The data distribution after adding EMOCAP :

_Shreyah_code/MOSEI/Text/Emotion/RobertaBert/Roberta_4emotions_Emocap_MOSEISMILE.ipynb

The training data after adding EMOCAP and SMILE's Sad and Anger emotion :

  • Sad 8823
  • Happy 6809
  • Neutral 6153
  • Anger 5406

    Classification report

                            precision    recall  f1-score   support
    
                     Happy   0.380000  0.078029  0.129472       487
                       Sad   0.589490  0.791816  0.675835      1955
                     Anger   0.290698  0.092764  0.140647       539
                   Neutral   0.365672  0.389817  0.377358      1257
    
                  accuracy                       0.501652      4238
                 macro avg   0.406465  0.338107  0.330828      4238
              weighted avg   0.461031  0.501652  0.456456      4238

Fusion

Multimodal approach

  • The Text and Audio model are separately trained is used here to collect the embeddings.
  • The embeddings are concatenated and fed to the classification layer.
  • Only the final classification layer is trained.

  • Audio :

    • Every audio segment extracted is passed to Wav2vec2 / Alexnet model and represented as 512 dimensions / 256 dimension vector.
    • The a pre-trained Alexnet based audio is saved in location:
  • Text :
  • Example of the fusion mechanism is demostrated below :

Alexnet + Bert

Results

                  Classification report
                               precision    recall  f1-score   support

                       angry       0.57      0.69      0.62        75
                       happy       0.82      0.62      0.70       214
                         sad       0.59      0.69      0.64       136
                     neutral       0.58      0.62      0.60       172

                    accuracy                           0.64       597
                   macro avg       0.64      0.65      0.64       597
                weighted avg       0.66      0.64      0.65       597 

Wav2vec2 + Bert

Results

Classification report precision recall f1-score support

               angry       0.53      0.81      0.64        78
               happy       0.86      0.63      0.73       236
                 sad       0.71      0.73      0.72       138
             neutral       0.62      0.68      0.65       192

            accuracy                           0.69       644
           macro avg       0.68      0.71      0.68       644
        weighted avg       0.72      0.69      0.69       644

Model Directory :

  • _SER/Models/Fusion/FusionWav2Vec2+Bert
  • _SER/Models/Fusion/FusionAlexnet+Bert

In [ ]:
 
In [ ]:
 
In [ ]: